Search CORE

8 research outputs found

CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos

Author: Choi Yejin
Dziri Nouha
Han Seungju
Hessel Jack
Yu Youngjae
Publication venue
Publication date: 16/08/2023
Field of study

Visual information is central to conversation: body gestures and physical behaviour, for example, contribute to meaning that transcends words alone. To date, however, most neural conversational models are limited to just text. We introduce CHAMPAGNE, a generative model of conversations that can account for visual contexts. To train CHAMPAGNE, we collect and release YTD-18M, a large-scale corpus of 18M video-based dialogues. YTD-18M is constructed from web videos: crucial to our data collection pipeline is a pretrained language model that converts error-prone automatic transcripts to a cleaner dialogue format while maintaining meaning. Human evaluation reveals that YTD-18M is more sensible and specific than prior resources (MMDialog, 1M dialogues), while maintaining visual-groundedness. Experiments demonstrate that 1) CHAMPAGNE learns to conduct conversation from YTD-18M; and 2) when fine-tuned, it achieves state-of-the-art results on four vision-language tasks focused on real-world conversations. We release data, models, and code.Comment: ICCV 2023, Project page: https://seungjuhan.me/champagn

arXiv.org e-Print Archive

Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties

Author: Bhagavatula Chandra
Choi Yejin
Dziri Nouha
Hwang Jena
Jiang Liwei
Levine Sydney
Lu Ximing
Pyatkin Valentina
Rao Kavel
Sap Maarten
Sorensen Taylor
Tasioulas John
West Peter
Publication venue
Publication date: 01/09/2023
Field of study

Human values are crucial to human decision-making. Value pluralism is the view that multiple correct values may be held in tension with one another (e.g., when considering lying to a friend to protect their feelings, how does one balance honesty with friendship?). As statistical learners, AI systems fit to averages by default, washing out these potentially irreducible value conflicts. To improve AI systems to better reflect value pluralism, the first-order challenge is to explore the extent to which AI systems can model pluralistic human values, rights, and duties as well as their interaction. We introduce ValuePrism, a large-scale dataset of 218k values, rights, and duties connected to 31k human-written situations. ValuePrism's contextualized values are generated by GPT-4 and deemed high-quality by human annotators 91% of the time. We conduct a large-scale study with annotators across diverse social and demographic backgrounds to try to understand whose values are represented. With ValuePrism, we build Kaleido, an open, light-weight, and structured language-based multi-task model that generates, explains, and assesses the relevance and valence (i.e., support or oppose) of human values, rights, and duties within a specific context. Humans prefer the sets of values output by our system over the teacher GPT-4, finding them more accurate and with broader coverage. In addition, we demonstrate that Kaleido can help explain variability in human decision-making by outputting contrasting values. Finally, we show that Kaleido's representations transfer to other philosophical frameworks and datasets, confirming the benefit of an explicit, modular, and interpretable approach to value pluralism. We hope that our work will serve as a step to making more explicit the implicit values behind human decision-making and to steering AI systems to make decisions that are more in accordance with them

arXiv.org e-Print Archive

Self-Refine: Iterative Refinement with Self-Feedback

Author: Alon Uri
Clark Peter
Dziri Nouha
Gao Luyu
Gupta Prakhar
Gupta Shashank
Hallinan Skyler
Madaan Aman
Majumder Bodhisattwa Prasad
Prabhumoye Shrimai
Tandon Niket
Welleck Sean
Wiegreffe Sarah
Yang Yiming
Yazdanbakhsh Amir
Publication venue
Publication date: 30/03/2023
Field of study

Like people, LLMs do not always generate the best text for a given generation problem on their first try (e.g., summaries, answers, explanations). Just as people then refine their text, we introduce SELF-REFINE, a framework for similarly improving initial outputs from LLMs through iterative feedback and refinement. The main idea is to generate an output using an LLM, then allow the same model to provide multi-aspect feedback for its own output; finally, the same model refines its previously generated output given its own feedback. Unlike earlier work, our iterative refinement framework does not require supervised training data or reinforcement learning, and works with a single LLM. We experiment with 7 diverse tasks, ranging from review rewriting to math reasoning, demonstrating that our approach outperforms direct generation. In all tasks, outputs generated with SELF-REFINE are preferred by humans and by automated metrics over those generated directly with GPT-3.5 and GPT-4, improving on average by absolute 20% across tasks.Comment: Code, data, and demo at https://selfrefine.info

arXiv.org e-Print Archive

Inference-Time Policy Adapters (IPA): Tailoring Extreme-Scale LMs without Fine-tuning

Author: Ammanabrolu Prithviraj
Brahman Faeze
Chandu Khyathi
Choi Yejin
Dziri Nouha
Fisher Jillian
Hallinan Skyler
Jang Jaehun
Jiang Liwei
Lin Bill Yuchen
Lu Ximing
Qin Lianhui
Ramnath Sahana
Ravichander Abhilasha
Ren Xiang
Welleck Sean
West Peter
Publication venue
Publication date: 24/05/2023
Field of study

Large language models excel at a variety of language tasks when prompted with examples or instructions. Yet controlling these models through prompting alone is limited. Tailoring language models through fine-tuning (e.g., via reinforcement learning) can be effective, but it is expensive and requires model access. We propose Inference-time Policy Adapters (IPA), which efficiently tailors a language model such as GPT-3 without fine-tuning it. IPA guides a large base model during decoding time through a lightweight policy adaptor trained to optimize an arbitrary user objective with reinforcement learning. On five challenging text generation tasks, such as toxicity reduction and open-domain generation, IPA consistently brings significant improvements over off-the-shelf language models. It outperforms competitive baseline methods, sometimes even including expensive fine-tuning. In particular, tailoring GPT-2 with IPA can outperform GPT-3, while tailoring GPT- 3 with IPA brings a major performance boost over GPT-3 (and sometimes even over GPT-4). Our promising results highlight the potential of IPA as a lightweight alternative to tailoring extreme-scale language models

arXiv.org e-Print Archive

Faith and Fate: Limits of Transformers on Compositionality

Author: Bhagavatula Chandra
Bras Ronan Le
Choi Yejin
Dziri Nouha
Ettinger Allyson
Harchaoui Zaid
Hwang Jena D.
Jian Liwei
Li Xiang Lorraine
Lin Bill Yuchen
Lu Ximing
Ren Xiang
Sanyal Soumya
Sclar Melanie
Welleck Sean
West Peter
Publication venue
Publication date: 29/05/2023
Field of study

Transformer large language models (LLMs) have sparked admiration for their exceptional performance on tasks that demand intricate multi-step reasoning. Yet, these models simultaneously show failures on surprisingly trivial problems. This begs the question: Are these errors incidental, or do they signal more substantial limitations? In an attempt to demystify Transformers, we investigate the limits of these models across three representative compositional tasks -- multi-digit multiplication, logic grid puzzles, and a classic dynamic programming problem. These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer. We formulate compositional tasks as computation graphs to systematically quantify the level of complexity, and break down reasoning steps into intermediate sub-procedures. Our empirical findings suggest that Transformers solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching, without necessarily developing systematic problem-solving skills. To round off our empirical study, we provide theoretical arguments on abstract multi-step reasoning problems that highlight how Transformers' performance will rapidly decay with increased task complexity.Comment: 10 pages + appendix (21 pages

arXiv.org e-Print Archive

Evaluating Open-Domain Question Answering in the Era of Large Language Models

Author: Clarke Charles L. A.
Dziri Nouha
Kamalloo Ehsan
Rafiei Davood
Publication venue
Publication date: 11/05/2023
Field of study

Lexical matching remains the de facto evaluation method for open-domain question answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate answer does not appear in the list of gold answers, which is increasingly the case as we shift from extractive to generative models. The recent success of large language models (LLMs) for QA aggravates lexical matching failures since candidate answers become longer, thereby making matching with the gold answers even more challenging. Without accurate evaluation, the true progress in open-domain QA remains unknown. In this paper, we conduct a thorough analysis of various open-domain QA models, including LLMs, by manually evaluating their answers on a subset of NQ-open, a popular benchmark. Our assessments reveal that while the true performance of all models is significantly underestimated, the performance of the InstructGPT (zero-shot) LLM increases by nearly +60%, making it on par with existing top models, and the InstructGPT (few-shot) model actually achieves a new state-of-the-art on NQ-open. We also find that more than 50% of lexical matching failures are attributed to semantically equivalent answers. We further demonstrate that regex matching ranks QA models consistent with human judgments, although still suffering from unnecessary strictness. Finally, we demonstrate that automated evaluation models are a reasonable surrogate for lexical matching in some circumstances, but not for long-form answers generated by LLMs. The automated models struggle in detecting hallucinations in LLM answers and are thus unable to evaluate LLMs. At this time, there appears to be no substitute for human evaluation.Comment: ACL 2023; code and data released at https://github.com/ehsk/OpenQA-eva

arXiv.org e-Print Archive